FastAI Lecture 03
Rectified linear Unit : y = mx + b
Calculating Loss
What are derivatives?
Derivatives define the rate of change for the particular function at that particular point of parameter. > In machine learning key is to know how to change the parameter (weights) of a function to reduce the loss. We can use derivatives as it gives us the understanding of change which would take place on altering weights. Calculus provides derivatives which can help us create gradients of the function - fastbook
Calculating derivatives for weights in NN
For neural networks with lots of weights, we find derivatives for each weight, treating others as constants. In deep learning, “gradients” mean values of a function’s derivatives. PyTorch’s requires_grad_() helps track and calculate these derivatives automatically.
def f(x): return x**2
= tensor(3.).requires_grad_()
xt
## Calculating function with the value
= f(xt)
yt
yt>>tensor(9., grad_fn=<PowBackward0>)
## Asking pytorch to calculate gradient for us
yt.backwards()# The "backward" here refers to _backpropagation_, which is the name given to the process of calculating the derivative of each layer.
xt.grad>> tensor(6.)
derivative of f(x) = x^2 is 2*x We found the same value with the xt.grad (gradient)
The gradients only tell us the slope of our function, they don’t actually tell us exactly how far to adjust the parameters. But it gives us some idea of how far; if the slope is very large, then that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value. - fastbook
Loss vs Metric
Aspect | Metric | Loss |
---|---|---|
Purpose Difference | Drives human understanding of performance | Drives automated learning by optimization |
Smoothness Requirement | Not constrained by smoothness | Requires smoothness for meaningful derivative |
Optimization vs. Real Goal | Reflects actual goals | Compromise between real goals and optimization |
Calculation Process | Provides overall model evaluation | Calculated per item, averaged at epoch end |
Focus Consideration | Primary focus for judging performance | Important for automated learning, may not directly represent end goal |
Why Batches?
After loss function calculation; When should the system update weights? if loss is calculated for one item it would not be much informational as it would result in imprecise and unstable gradient if loss is calculated for entire dataset it would take very long
Mini Batch
So, we count the average loss for few data items at a time (Mini Batch) BatchSize = Number of items
Batch Size | Quality | Time | Size |
---|---|---|---|
Larger | more accurate and stable estimate of your dataset’s gradients from the loss function | longer time to process | will process fewer mini-batches per epoch |
NOTE: We can’t use large batch size due to limitation of GPU memory
Randomization with mini batches
Dataset creates list of input-label tuples which is passed into DataLoaders both in PyTorch and FastAI so that random mini batches can be created
= L(enumerate(string.ascii_lowercase))
ds
ds>> (#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]
= DataLoader(ds, batch_size=6, shuffle=True)
dl list(dl)
>> [(tensor([17, 18, 10, 22, 8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),
20, 15, 9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),
(tensor([7, 25, 6, 5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),
(tensor([ 1, 3, 0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),
(tensor([ 2, 4]), ('c', 'e'))] (tensor([
Term | Meaning |
---|---|
ReLU | Function that returns 0 for negative numbers and doesn’t change positive numbers. |
Mini-batch | A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch). |
Forward pass | Applying the model to some input and computing the predictions. |
Loss | A value that represents how well (or badly) our model is doing. |
Gradient | The derivative of the loss with respect to some parameter of the model. |
Backward pass | Computing the gradients of the loss with respect to all model parameters. |
Gradient descent | Taking a step in the directions opposite to the gradients to make the model parameters a little bit better. |
Learning rate | The size of the step we take when applying SGD to update the parameters of the model. |